Connecting SOC with RL – Importance sampling

Alonso Cisneros

Zuse Institute Berlin

Outline

  1. Crash course on RL

  2. What is importance sampling

  • The connection to optimization
  • Optimal biasing
  1. Optimal biasing as an RL problem

  2. The things I’d like to connect

Crash course on Reinforcement Learning

A miniopoly board

  • The game has a state at turn \(t\) denoted \(s_t\)
  • At a turn \(t\) players roll the dice
  • The change in money after buying/paying rent/charging rent is recorded as a reward \(r_t\)

Important

We train our robot to maximize the rewards as it takes actions exploring the space of states

Dynamics of the space

What if we don’t know how square transitions work?

We calculated transition probability with the knowledge of the dice

Markov Chain Monte Carlo

  • We let the robot roam around and buy squares as it pleases
    • For any square, it can either buy it or not
  • We register what it gained or lost by buying or not buying a square by the end of the game.

Importance Sampling

  • We wanted to compute the expected reward of the robot after the entire game
  • MCMC often fails in metastable systems
  • Importance sampling aims to remedy this

Important

The general idea of importance sampling is to draw random variables from another probability measure and subsequently weight them back in order to still have an unbiased estimator of the desired quantity of interest

More formally…

  • The metastable stochastic system we sample from follows Langevin dynamics \[\begin{equation} \mathrm{d}X_s = - \nabla V(X_s) \, \mathrm{d}s + \sigma(X_s) \, \mathrm{d}W_s \end{equation}\]
  • We want to hit a target set \(\mathcal{T}\). We define \[\begin{equation} \tau = \inf \{ s > 0 \mid X_s \in \mathcal{T} \} \end{equation}\]
  • We’re interested in computing \(I: C([0, \infty), \mathbb{R}^d) \to \mathbb{R}\) \[\begin{equation} I(X) \coloneqq \exp(- \mathcal{W}(X)) \end{equation}\]

Our main goal is to compute \[ \Psi(X) \coloneqq \mathbb{E}^x [I(X)] \coloneqq \mathbb{E}[I(X) \mid X_0 = x] \]

But…

  • MCMC has terrible properties because of metastability
  • Closed forms of \(\Psi(X)\) maybe don’t exist

Tip

  • We can “push” the particle adding force, as long as we account for it and correct for that bias
  • That “push” is achieved by adding a control \(u\).

The new, controlled dynamics are now described as \[\begin{align*} \label{eq: controlled langevin sde} \mathrm dX_s^u &= (-\nabla V(X_s^u) + \sigma(X_s^u) \, u(X_s^u))\mathrm ds + \sigma(X_s^u) \mathrm dW_s, \\ X_0^u &= x \end{align*}\]

Via Girsanov, we can relate our QoI to the original as such: \[\begin{equation} \label{eq: expectation IS} \mathbb{E}^x\left[I(X)\right] = \mathbb{E}^x\left[I(X^u) M^u\right], \end{equation}\]

where the exponential martingale \[\begin{equation} \label{eq: girsanov martingale} M^u \coloneqq \exp{\left(- \int_0^{\tau^u} u(X_s^u) \cdot \mathrm dW_s - \frac{1}{2} \int_0^{\tau^u} |u(X_s^u)|^2 \mathrm ds \right)} \end{equation}\] corrects for the bias the pushing introduces.

Important

The previous relationship always holds. But the variance of the estimator depends heavily on the choice of \(u\).

Clearly, we aim to achieve the smallest possible variance through on optimal control \(u^*\) \[\begin{equation} \label{eq: variance minimization} \operatorname{Var} \left( I(X^{u^*}) M^{u^*} \right) = \inf_{u \in \mathcal{U}} \left\{ \operatorname{Var} (I(X^u) M^u) \right\} \end{equation}\]

Connection to optimization

It turns out 1 that the problem of minimizing variance corresponds to a problem in optimal control

The cost functional \(J\) to find the variance minimizing control is \[\begin{equation} \label{eq: cost functional} J(u; x) \coloneqq \mathbb{E}^x\left[\mathcal{W}(X^u) + \frac{1}{2} \int_0^{\tau^u} |u(X_s^u)|^2 \mathrm ds \right], \end{equation}\]

With this formulation, \[\begin{equation} \Phi(x) = \inf_{u \in \mathcal{U}} J(u; x). \end{equation}\]

Important

The optimal bias achieves zero variance: \[\begin{equation} \operatorname{Var} \left( I(X^{u^*}) M^{u^*} \right) = 0. \end{equation}\]

Optimal biasing through RL

  • Let’s reconsider the SOC problem (excuse the change in notation)
  • We discretize the dynamics equation \[\begin{align*} s_{t+1} &= s_t + \left( -\nabla V(s_t) + \sigma u(s_t)\right) \Delta t + \sigma \, \sqrt{\Delta t} \, \eta_{t+1} \\ s_0 &= x \end{align*}\]

The time-discretized objective function is given by \[\begin{equation} \small J(u; x) \coloneqq \mathbb{E}^{x} \left[ g(s_{T_u}) + \sum_{t=0}^{T_{u-1}} f(s_t) \Delta t + \frac{1}{2} \sum_{t=0}^{T_{u-1}} |u(s_t)|^2 \Delta t \right] \end{equation}\]

  • Our stopping time \(\tau\) is now denoted \(T_u\)

Some formalities

  • The state space \(\mathcal{S}\) is all possible \(s \in \mathbb{R}^d\)
  • The action space \(\mathcal{A}\) is the codomain of all possible controls \(\mathbb{R}^d\)
  • The stopping time \(T_u\) for the controlled process is a.s. finite
  • We’ll approximate the control with Galerkin projections \(u_\theta\)
  • We still need to derive probability transition and reward functions
  • The return we want to optimize depends on a rewards function \[ r_t = r(s_t, a_t) \coloneqq \begin{cases} - f(s_t) \Delta t - \frac{1}{2} |a_t|^2 \Delta t & \text{if} \; s_t \notin \mathcal{T} \\ -g(s_t) & \text{if} \quad s_t \in \mathcal{T}. \end{cases} \]

What I want to do

Two posibilities

  1. Take the SOC already published in (Helfmann et al. 2023) and pose the right cost functional (the “easy one”)
  2. Go back to the MFE and then
  • Break the assumption of fully connected Agent-Agent network
  • Find the MFE without that assumption
  • Pose a SOC
  • Find the right cost functional

What I’m trying at the moment

  • Approach it from the MFE side
    • Looking for generalizations of McKean-Vlasov PDEs with time-dependent networks
    • Thinking what graphon I can get from the network
  • Finding the right cost functional on the “easy” case

Bonus: Assortativity & network topology

References

Helfmann, Luzie, Nataša Djurdjevac Conrad, Philipp Lorenz-Spreen, and Christof Schütte. 2023. “Modelling Opinion Dynamics Under the Impact of Influencer and Media Strategies.” Scientific Reports 13 (1): 19375. https://doi.org/10.1038/s41598-023-46187-9.
Quer, J., and Enric Ribera Borrell. 2024. “Connecting Stochastic Optimal Control and Reinforcement Learning.” Journal of Mathematical Physics 65. https://doi.org/10.1063/5.0140665.
Ribera Borrell, Enric, Jannes Quer, Lorenz Richter, and Christof Schütte. 2024. “Improving Control Based Importance Sampling Strategies for Metastable Diffusions via Adapted Metadynamics.” SIAM Journal on Scientific Computing 46 (2): S298–323. https://doi.org/10.1137/22M1503464.